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Abstract 

When applying aggregating strategies to Prediction with Expert Ad- 
vice, the learning rate must be adaptively tuned. The natural choice of 
^/complexity/current loss renders the analysis of Weighted Majority deriva- 
tives quite complicated. In particular, for arbitrary weights there have been 
no results proven so far. The analysis of the alternative "Follow the Per- 
turbed Leader" (FPL) algorithm from [KV03] (based on Hannan's algorithm) 
is easier. We derive loss bounds for adaptive learning rate and both finite ex- 
pert classes with uniform weights and countable expert classes with arbitrary 
weights. For the former setup, our loss bounds match the best known results 
so far, while for the latter our results are (to our knowledge) new. 

Keywords 

Prediction with Expert Advice, Follow the Perturbed Leader, general weights, 
adaptive learning rate, hierarchy of experts, expected and high probability 
bounds, general alphabet and loss, online sequential prediction, 



*This work was supported by SNF grant 2100-67712.02. 



1 



1 Introduction 



The theory of Prediction with Expert Advice (PEA) has rapidly developed in the 
recent past. Starting with the Weighted Majority (WM) algorithm of Littlestone 
and Warmuth [LW89, LW94] and the aggregating strategy of Vovk [Vov90], a vast 
variety of different algorithms and variants have been published. A key parameter 
in all these algorithms is the learning rate. While this parameter had to be fixed 
in the early algorithms such as WM, [CB97] established the so-called doubling trick 
to make the learning rate coarsely adaptive. A little later, incrementally adaptive 
algorithms were developed [AGOO, ACBG02, YEYS04, Gen03]. Unfortunately, the 
loss bound proofs for the incrementally adaptive WM variants are quite complex 
and technical, despite the typically simple and elegant proofs for a static learning 
rate. 

The complex growing proof techniques also had another consequence: While for 
the original WM algorithm, assertions are proven for countable classes of experts 
with arbitrary weights, the modern variants usually restrict to finite classes with 
uniform weights (an exception being [Gen03], see the discussion section). This 
might be sufficient for many practical purposes but it prevents the application to 
more general classes of predictors. Examples are extrapolating (=predicting) data 
points with the help of a polynomial (=expert) of degree d— 1,2,3,... -or- the (from a 
computational point of view largest) class of all computable predictors. Furthermore, 
most authors have concentrated on predicting binary sequences, often with the 0/1 
loss for {0,l}-valued and the absolute loss for [0,l]-valued predictions. Arbitrary 
losses are less common. Nevertheless, it is easy to abstract completely from the 
predictions and consider the resulting losses only. Instead of predicting according 
to a "weighted majority" in each time step, one chooses one single expert with a 
probability depending on his past cumulated loss. This is done e.g. in [FS97], where 
an elegant WM variant, the Hedge algorithm, is analyzed. 

A very different, general approach to achieve similar results is "Follow the Per- 
turbed Leader" (FPL). The principle dates back to as early as 1957, now called 
Hannan's algorithm [Han57]. In 2003, Kalai and Vempala published a simpler proof 
of the main result of Hannan and also succeeded to improve the bound by modifying 
the distribution of the perturbation [KV03]. The resulting algorithm (which they 
call FPL*) has the same performance guarantees as the WM-type algorithms for 
fixed learning rate, save for a factor of \/2. A major advantage we will discover in 
this work is that its analysis remains easy for an adaptive learning rate, in contrast 
to the WM derivatives. Moreover, it generalizes to online decision problems other 
than PEA. 

In this work, we study the FPL algorithm for PEA. The problems of WM algo- 
rithms mentioned above are addressed: We consider countable expert classes with 
arbitrary weights, adaptive learning rate, and arbitrary losses. Regarding the adap- 
tive learning rate, we obtain proofs that are simpler and more elegant than for the 
corresponding WM algorithms. (In particular the proof for a self-confident choice of 
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the learning rate, Theorem 7, is less than half a page). Further, we prove the first 
loss bounds for arbitrary weights and adaptive learning rate. Our result even seems 
to be the first for equal weights and arbitrary losses, however the proof technique 
from [ACBG02] is likely to carry over to this case. 

This paper is structured as follows. In Section 2 we give the basic definitions. 
Sections 3 and 4 derive the main analysis tools, following the lines of [KV03], but 
with some important extensions. They are applied in order to prove various upper 
bounds in Section 5. Section 6 proposes a hierarchical procedure to improve the 
bounds for non-uniform weights. In Section 7, a lower bound is established. Section 7 
treats some additional issues. Finally, in Section 8 we discuss our results, compare 
them to references, and state some open problems. 

2 Setup & Notation 

Setup. Prediction with Expert Advice proceeds as follows. We are asked to perform 
sequential predictions yt&y at times t — 1,2,.... At each time step t, we have access 
to the predictions (yl)i<i< n of n experts {ei,...,e n }. After having made a prediction, 
we make some observation x t EX, and a Loss is revealed for our and each expert's 
prediction. (E.g. the loss might be 1 if the expert made an erroneous prediction and 
otherwise. This is the 0/1-loss.) Our goal is to achieve a total loss "not much 
worse" than the best expert, after t time steps. 

We admit nG WU{oo} experts, each of which is assigned a known complexity 
k l >0. Usually we require X^ e_fcl <1> f° r instance k % = 21n(i+ 1). Each complexity 
defines a weight by means of e~ k% and vice versa. In the following we will talk 
rather of complexities than of weights. If n is finite, then usually one sets k % = Inn 
for all i, this is the case of uniform complexities/weights. If the set of experts is 
countably infinite (n = oo), uniform complexities are not possible. The vector of 
all complexities is denoted by k = {k l )i<i< n . At each time t, each expert i suffers 
a loss 1 s\ =Loss(x t ,y l t ) G [0,1], and s t — (4)i<i<n is the vector of all losses at time t. 
Let s <t — si + ... + St-i (respectively si :t = s± + ... + s t ) be the total past loss vector 
(including current loss St) and s™" = minj{s\. t } be the loss of the best expert in 
hindsight (BEE). Usually we do not know in advance the time t > at which the 
performance of our predictions are evaluated. 

General decision spaces. The setup can be generalized as follows. Let ScM n be 
the state space and V C M n the decision space. At time t the state is s t G S, and a 
decision d t &T> (which is made before the state is revealed) incurs a loss d t °s t , where 
"°" denotes the inner product. This implies that the loss function is linear in the 
states. Conversely, each linear loss function can be represented in this way. The 
decision which minimizes the loss in state s G S is 

M(s) := argmin{d s} (1) 
1 The setup, analysis and results easily scale to s\€ [0,5] for 5>0 other than 1. 
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if the minimum exists. The application of this general framework to PEA is straight- 
forward: T> is identified with the space of all unit vectors £ = {e,i \ l<i<n}, since a 
decision consists of selecting a single expert, and s t E [0,l] n , so states are identified 
with losses. Only Theorem 2 will be stated in terms of general decision space, where 
we require that all minima are attained. 2 Our main focus is V = £. However, all our 
results generalize to the simplex V = A = {vE [0,l] n : = 1}, since the minimum 
of a linear function on A is always attained on £. 

Follow the Perturbed Leader. Given s <t at time t, an immediate idea to solve 
the expert problem is to "Follow the Leader" (FL), i.e. selecting the expert ej which 
performed best in the past (minimizes s<J, that is predict according to expert 
M(s<t). This approach fails for two reasons. First, for n = oo the minimum in 
(1) may not exist. Second, for n = 2 and s— (ioioio )' always chooses the 
wrong prediction [KV03]. We solve the first problem by penalizing each expert by 
its complexity, i.e. predicting according to expert M(s <t +k). The FPL (Follow the 
Perturbed Leader) approach solves the second problem by adding to each expert's 
loss s l <t a random perturbation. We choose this perturbation to be negative expo- 
nentially distributed, either independent in each time step or once and for all at the 
very beginning at time t = 0. These two possibilities are equivalent with respect to 
expected losses, since the expectation is linear. The former choice is preferable in 
order to protect against an adaptive adversary who generates the s t , and in order 
to get bounds with high probability (Section 7). For the main analysis however, the 
latter is more convenient. So henceforth we can assume without loss of generality 
one initial perturbation q when dealing with expected loss. 

The FPL algorithm is defined as follows: 

Choose random vector g~exp, i.e. P[q l = u] = e~ u for all l<i<n. 
For t = l,...,T 

- Choose learning rate Et- 

- Output prediction of expert i which minimizes s l <t + {k l — q % ) / Et- 

- Receive loss s\ for all experts i. 

Other than s <t , k and q, FPL depends on the learning rate e t . We will give choices 
for e t in Section 5, after having established the main tools for the analysis. The 
expected loss at time t of FPL is £ t :— E[M(s <t + — ) °s t ]. The key idea in the 
FPL analysis is the use of an intermediate predictor IFPL (for Implicit or Infeasible 
FPL). IFPL predicts according to M(s 1:t + ^f 3 -), thus under the knowledge of s t 
(which is of course not available in reality). By r t := E[M(si : t + ^f^)°s t ] we denote 
the expected loss of IFPL at time t. The losses of IFPL will be upper bounded by 
BEH in Section 3 and lower bounded by FPL in Section 4. 

Notes. Observe that we have stated the FPL algorithm regardless of the actual 
predictions of the experts and possible observations, only the losses are relevant. 

2 Apparently, there is no natural condition on D and/or S which guarantees the existence of all 
minima for n = oo. 
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Note also that an expert can implement a highly complicated strategy depending 
on past outcomes, despite its trivializing identification with a constant unit vector. 
The complex expert's (and environment's) behavior is summarized and hidden in the 
state vector s t —Loss(x t ,yl)i<i< n . Our results therefore apply to arbitrary prediction 
and observation spaces y and X and arbitrary bounded loss functions. This is in 
contrast to the major part of PEA work developed for binary alphabet and 0/1 or 
absolute loss only Finally note that the setup allows for losses generated by an 
adversary who tries to maximize the regret of FPL and knows the FPL algorithm 
and all experts' past predictions/losses. If the adversary also has access to FPL's 
past decisions, then FPL must use independent randomization at each time step in 
order to achieve good regret bounds. 

3 IFPL bounded by Best Expert in Hindsight 

In this section we provide tools for comparing the loss of IFPL to the loss of the 
best expert in hindsight. The first result bounds the expected error induced by the 
exponentially distributed perturbation. 

Lemma 1 (Maximum of Shifted Exponential Distributions) Let q l ,...,q n be 
identically exponentially distributed random variables, i.e. P[q l ]=e~ ql forq l >0 and 
l<i<n<oo, andk l EM be real numbers with u: = Y^i=\Z~ K '■ Then 

E[max{q i — k 1 }} < 1 + lnw. 

i 

Proof. Using P[q i >b]<e~ b for belR we get 

n n 

P[max{^ - k 1 } >a]= P[3i : q l - k l > a] < ^ P[q* ~ k l > a] < ^ e~ a ~ kl =u-e~ a 
1 i=i i=i 

where the first inequality is the union bound. Using E[z] < E[max{0,z}] = 
J °°P[max{0,z} > y]dy = J^°P[z > y]dy (valid for any real-valued random variable 
z) for z = maxj{(f — k 1 } — lnu, this implies 

P[max{g* - k 1 } - lnw] < / P\ max{<f - k 1 } > y + \nu]dy < / e~ y dy = 1, 
* Jo * JO 

which proves the assertion. □ 

If n is finite, a lower bound P[maxj<f] > 0.57721 + Inn can be derived, showing 
that the upper bound on E^max] is quite tight (at least) for k l = Vi The following 
bound generalizes [KV03, Lem.3] to arbitrary weights. 

Theorem 2 (IFPL bounded by BEH) Let V C FT, s t eM n for l<t<T (both 
V and s may even be negative, but we assume that all required extrema are attained), 
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and q,k£lR n . If e t >0 is decreasing in t, then the loss of the infeasible FPL knowing 
s t at time t in advance (l.h.s.) can be bounded in terms of the best predictor in 
hindsight (first term on r.h.s.) plus additive corrections: 

T k—q k 1 lk 

y^ j M(si :t -\ ) °St < mm{d°(si;T-\ )} H m&x{d°(q — k)} M(si :T H ) °q. 

t=i 



Proof. For notational convenience, let Eo = oo and §i :t — si : t + ^— . Consider the 
losses s t — s t + (k — q)(j- t — f° r the moment. We first show by induction on T 
that the infeasible predictor M(si :t ) has zero regret, i.e. 

T 

£ M (*1*)° 5 * < M(s 1: t)°S1:T- (2) 
t=l 

For T—l this is obvious. For the induction step from T — 1 to T we need to show 

M(§i : t) °St < M(§i : t) °§1:T — M(s <T ) °S <T . 

This follows from Si : t = s<t+St and M(si : t)°s < t>M(s < t)°s < t by minimality of 
M. Rearranging terms in (2), we obtain 

£M(S 1:t )°s t < AfM°si:T-E M (^)°(^-9)(--— ) (3) 

t=l t=l £ t £ *-l 

Moreover, by minimality of M, 

M(s 1 . T )°s 1 . T < M(si. T + — )°(s 1 . T + — -) (4) 

v £ T > \ St ' 

= min jd°(si-T + — )\ -M(s 1 . T + — )° — 
dev [ e T J v e T J e T 

holds. Using ^ — -^—^ > and again minimality of M, we have 

2(1- J-)M(a 1:t ) •(?-*) < E(--—)M(fc- 9)0(9-*) (5) 

1 1 

= — M(k — q)°(q — k) = — max{d°(g — &)} 

Inserting (4) and (5) back into (3) we obtain the assertion. □ 

Assuming q random with E[q l ] = 1 and taking the expectation in Theorem 2, 
the last term reduces to — ^X^=i j ^( s i:T + ~) 1 - HT>>0, the term is negative and 
may be dropped. In case of T> = £ or A, the last term is identical to — ^ (since 
J2id l — 1) and keeping it improves the bound. Furthermore, we need to evaluate the 
expectation of the second to last term in Theorem 2, namely E\m?ix ( iev{d°{q— k)}}. 
For T> — £ and q being exponentially distributed, using Lemma 1, the expectation is 
bounded by 1+hxu. We hence get the following bound: 
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Corollary 3 (IFPL bounded by BEH) For V = S and Y.fi~ k% < 1 and P[<f ] = 
e~ ql for q>0 and decreasing e t > 0, the expected loss of the infeasible FPL exceeds 
the loss of expert i by at most k l /st'- 

T\-t < s\. T H k % for all % < n. 

Theorem 2 can be generalized to expert dependent factorizable Et^£\ = £f£ l by 
scaling k l ^k l /e t and q % ^q l /e % . Using i^maxj-f 2 ^^-}] < £[maxj{g* — k t }}/m\ii i {e' l } ) 
Corollary 3, generalizes to 

t=l t t t T t T 



where e^ m := mmi{e T }. For example, for e\ = yk l /t, additionally assuming k l > l 

Wi, we get the desired bound s\. T + \jT-{k l + l). Unfortunately we were not able to 
generalize Theorem 4 to expert-dependent e, necessary for the final bound on FPL. 
In Section 6 we solve this problem by a hierarchy of experts. 

4 Feasible FPL bounded by Infeasible FPL 

This section establishes the relation between the FPL and IFPL losses. Recall that 
£ t = E[M(s <t + ^ 7 f-)°St} is the expected loss of FPL at time t and r t = E[M(s l:t + 
^ 2 )°S(] is the expected loss of IFPL at time t. 

Theorem 4 (FPL bounded by IFPL) ForV = S and 0<s l t <l Vi and arbitrary 
s <t and P[q % ] = e~ ql , the expected loss of the feasible FPL is at most a factor e £t > l 
larger than for the infeasible FPL: 

T 

l t < e £t r t , which implies I\-,t — t\._t < e t £t- 

t=i 

Furthermore, ife t <l, then also £ t < (l+e t +ef)r t < (l+2e t )r t . 

Proof. Let s = s <t + ^k be the past cumulative penalized state vector, q be a vector 
of exponential distributions, i.e. P[q l ] = e~ q \ and e = e t . We now define the random 
variables / := argminj{s l — ^q 1 } and J := argmhij {s* + sj — \q % }, where < s\ < 1 Vi 
Furthermore, for fixed vector xEM n and fixed j we define m:=min i7 y 
mmi-^j{s l +s % t — ^x l }=:m'. With this notation and using the independence of g J from 
q % for all ij^j, we get 

P[I = j\ q i = x i Vi^ j] = P[s j - \q j < mlq* = x i Viy£ j] = P[qi > e(s j - m)\ 
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< e £ P[q j > e(s j -m + 1)] < e £ P[q j > e(s j + s{ - m')\ 

= e £ P[s j + s{ - \q J < mV = x i Vi ^ j] = e £ P[J = j\q l = x i Vi^ j], 

where we have used P[q j > a] < e £ P[q j > a+e]. Since this bound holds under any 
condition x, it also holds unconditionally, i.e. P[I = j] <e £ P[J = j]. For T> = £ we 
have s I t =M(s <t + ^-)°s t and sf = M(s 1:t + ^-)°s t , which implies 

n n 

t t = E[si] = Y,4-P[I = J] < e £ Y,st-P[J = j] = e £ E[s J t } = e £ r t . 

3=1 3=1 

Finally, £ t -r t <e t £ t follows from r t >e- £t £ t > (l-e t )£ t , and £ t <e £t r t < (l+e t +e 2 t )r t < 
(l+2e t )r t for e t <l is elementary. □ 

Remark. As in [KV03], one can prove a similar statement for general decision 
space D as long as Sil s tl <A is guaranteed for some A>0: In this case, we have 
£t<e £tA r t . If n is finite, then the bound holds for A — n. For n = oo, the assertion 
holds under the somewhat unnatural assumption that S is ^-bounded. 

5 Combination of Bounds and Choices for e t 

Throughout this section, we assume 

V = S, s t e [0, l] n Vt, = e- ql Vi, and ^ L ( 6 ) 

i 

We distinguish static and dynamic bounds. Static bounds refer to a constant £< = £. 
Since this value has to be chosen in advance, a static choice of e t requires certain 
prior information and therefore is not practical in many cases. However, the static 
bounds are very easy to derive, and they provide a good means to compare different 
PEA algorithms. If on the other hand the algorithm shall be applied without ap- 
propriate prior knowledge, a dynamic choice of Et depending only on t and/or past 
observations, is necessary. 

Theorem 5 (FPL bound for static e t — e ocl/\/L) Assume (6) holds, then the 
expected loss £ t of feasible FPL, which employs the prediction of the expert i mini- 
mizing s l <t + k ~ t q , is bounded by the loss of the best expert in hindsight in the following 
way: 

i) For et = e = 1 / \f~L with L>£ VT we have 

£i-.t < s\. T + y/Lft + 1) Vi 
ii) For E t = \Jk/L with L > £ 1:T and k l < K \/i we have 

£i-.t < s\. t + 2vTk Vi 
Hi) For E t = \jk} j L with L > max{s^. T , k % } we have 
£i-.t < s\ :T + + 3k* 
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Note that according to assertion (in), knowledge of only the ratio of the com- 
plexity and the loss of the best expert is sufficient in order to obtain good static 
bounds, even for non-uniform complexities. 



Proof. (i,ii) For e t = ^J K/ L and L>£ 1:T , from Theorem 4 and Corollary 3, we get 



h-.T - ri:T < e t £ t = £ 1:T ^K/L < vTk and r 1:T - s\. T < k'/ex = k^L/K 



t=i 



Combining both, we get 1\-t~ s\. t < VL(VK + k l / \^K) . (i) follows from K — l and 
(ii) from k l <K. 



Hi) For s — J k % I L < 1 we get 



L 



£i-.t < e £ r 1:T < (1 + e + e 2 )r 1:T < (1 + \ - + -)(s\ :T + \ -k l ) 



L L 



Ik* k 



< s\ :T + VZk 1 + (\ - + -)(L + vTk 1 ) = s\, T + 2vTk l + (2 + W— 



L L 



□ 



The static bounds require knowledge of an upper bound L on the loss (or the 
ratio of the complexity of the best expert and its loss). Since the instantaneous 
loss is bounded by 1, one may set L = T if T is known in advance. For finite n 
and k l = K = Inn, bound (ii) gives the classic regret oc VTlnn. If neither T nor L 
is known, a dynamic choice of St is necessary. We first present bounds with regret 
oc y/T, thereafter with regret oc \js\. T . 

Theorem 6 (FPL bound for dynamic e t (xl/y/t) Assume (6) holds. 

i) For e t = l/Vi we have £ 1:T < s[ :T + VT(k* + 2) Vi 
ii) For e t = y/K/2t and k { < K Mi we have £ 1:T < s\. T + 2V2TK Mi 

Proof. For e t = ^K/2t, using £f =1 ^ < ff^. = 2y/T and £ t < 1 we get 

T 

ti-.T ~ n-.r < ^ £* < v / 2TA 7 and r 1:T - s\. T < V/et = k i 
t=i 

Combining both, we get £ 1:T -s\ :T < \f^(\fK + k { j \fK). (i) follows from K = 2 
and (ii) from k l <K. □ 

In Theorem 5 we assumed knowledge of an upper bound L on £\-t- In an adaptive 
form, L t :—£ <t +l, known at the beginning of time t, could be used as an upper bound 
on £ i:t with corresponding adaptive e t ocl/ \/l7 t . Such choice of e t is also called self- 
confident [ACBG02]. 
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Theorem 7 (FPL bound for self-confident e t (xl/y/£~^) Assume (6) holds. 

i) For e t = l/y/2(£ <t + 1) we have 

h-.T < s\ :T + (^ + l)^2(s l 1:T + l) + 2(k l + l) 2 Vi 

ii) For e t = ^K/2(£ <t + 1) and k l < K \/i we have 

tv.T < s[ :T + 2^2(s\. T + l)K + 8K Vi 

Proof. Using e t = ^K/2(£ <t + l) < ^K/2£ 1:t and = (Vb- y/a){y/b+y/a)^ < 
2(Vb—y^a) for a<b and t = min{^:£i :t >0} we get 

t=t " ^ t=to V*-l:t t=t 

Adding ri :T -s* 1:T < g < ^^2(^ + 1)/ K we get 

4r - 4:T < \/ 2** (£ 1:T + 1) , where v^:=VK + fc'/V^. 
Taking the square and solving the resulting quadratic inequality w.r.t. we get 

4:T < 4:T + «* + v / 2 (si:T + 1 )^ + i^Y < 4:T + \/ 2 ( S l:T + 1 )« < + 2 ^ 

For K = l we get v^ = & l + l which yields (i). For k % <K we get ft* < 4i\" which 
yields (ii). □ 

The proofs of results similar to (ii) for WM for 0/1 loss all fill several pages 
[ACBG02, YEYS04]. The next result establishes a similar bound, but instead of 
using the expected value £ <t , the best loss so far s™ n is used. This may have 
computational advantages, since s< 4 m is immediately available, while £ <t needs to 
be evaluated (see discussion in Section 7). 



Theorem 8 (FPL bound for adaptive e t oc 1/y s<| n ) Assume (6) holds. 

i) For e t = I/mini^ + J (k { ) 2 + 2s* <t + 2} we nave 

4t < s\ :T + (k i +2)^2s[^+2(k i +2) 2 Vi 
ii) For e t = y^-min{l, \J Kj s<| n } and k l < K \/i we have 
£i-.t < s™ft + 2y/2Ks%F + 5K ln(s?£ n ) +3K + Q. 
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We briefly motivate the strangely looking choice for e t in (i). The first naive can- 
didate, e t oc 1/ \Js<l n , turns out too large. The next natural trial is requesting 
e t — l/ y / 2min{s< i + ^-}. Solving this equation results in e t = l/( k l + \J (k i ) 2 + 2s l <t ) , 
where i be the index for which s\+ + — is minimal. 

Proof. Similar to the proof of the previous theorem, but more technical. □ 

The bound (i) is a complete square, and also the bounds of Theorem 7 when 
adding 1 to them. Hence the bounds can be written as \Jt\-.T < \Js[. T + y/2(k l + 2) 
and \Jt\-.T < \Js\. T +l + \/8K and \[£~vt < \J s\. T + l + ^/2(k' l + l), respectively, hence 
the vToss-regrets are bounded for T^oo. 

Remark. The same analysis as for Theorems [5-8] (ii) applies to general V, using 
£ t <e £tn r t instead of £t<e £t r t , and leading to an additional factor ^/n in the regret. 
Compare the remark at the end of Section 4. 



6 Hierarchy of Experts 

We derived bounds which do not need prior knowledge of L with regret oc \/TK 
and oc \j s\. T K for a finite number of experts with equal penalty K = k l = Inn. For 
an infinite number of experts, unbounded expert-dependent complexity penalties k % 
are necessary (due to constraint X^e~ fe! < !)■ Bounds for this case (without prior 

knowledge of T) with regret ock z VT and (xk l \J s\. T have been derived. In this case, 
the complexity k l is no longer under the square root. It is likely that improved 
regret bounds oc \jTk % and oc \j s\. T k % as in the finite case hold. We were not able to 
derive such improved bounds for FPL, but for a (slight) modification. We consider 
a two-level hierarchy of experts. First consider an FPL for the subclass of experts 
of complexity K, for each K G IN. Regard these FPL X as (meta) experts and 
use them to form a (meta) FPL. The class of meta experts now contains for each 
complexity only one (meta) expert, which allows us to derive good bounds. In the 
following, quantities referring to complexity class K are superscripted by K, and 
meta quantities are superscripted by ~. 

Consider the class of experts £ K :—{i:K— l<k l <K} of complexity K, for each 
K E IN. FPL X makes randomized prediction if := argminj g £K{s l <t + ^^-} with 

ef := yj 1 K/2t and suffers loss uf :=s[* at time t. Since k l < K Wi G £ k we can apply 
Theorem Q(ii) to FPL X : 

E[uf :T ] = £f T < s\. T + 2V2TK VieE K VK eW (7) 

We now define a meta state sf = uf and regard FPL^ for K G IV as meta experts, 
so meta expert K suffers loss sf. (Assigning expected loss sf — E[uf\ —if to FPL K 
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would also work.) Hence the setting is again an expert setting and we define the 
meta FPL to predict I t : = argmin^g w { sf t + kK J t qK } with e t — l/y/i and h K = \+2hiK 

(implying T,K=i e ' < 1 )- Note that Kt = ~ s Y =s 1 1 +...+ St sums over the 

same meta state components K, but over different components If in normal state 
representation. 

By Theorem 6(i) the g-expected loss of FPL is bounded by sf. T + \/T(k K + 2). 
As this bound holds for all q it also holds in g-expectation. So if we define l\. T to be 
the q and q expected loss of FPL, and chain this bound with (7) for ie£ K we get: 

h.T < E[sf T + Vf(k K +2)} = £f :T + VT(k K +2) 
< s\ :T + VT[2 y /2(k i + 1) + \ + 2 ha(k i + 1) + 2], 

where we have used K <k l + 1. This bound is valid for all % and has the desired 



regret oc yfk 1 . Similarly we can derive regret bounds oc y s\. T k l by exploiting that 
the bounds are concave in s\. T and using Jensen's inequality. 

Theorem 9 (Hierarchical FPL bound for dynamic e t ) The hierarchical FPL 
employs at time t the prediction of expert i t :—lf, where 

if : = arg min{s l <t + ^f} and I t := arg min {sf + ... + sf^ + k+lhfzsll 

Under assumptions (6) and P[q K ] — e~^ K VK&IN, the expected loss £ 1:T = E[s l i + 
...+Sy] of FPL is bounded as follows: 



a) For ef = y/K/2t and et — l/y/i we have 
h-.T < si :T + 2v / 2W-(l + 0(^|)) Vi. 

b) For e t as in (i) and ef as in (ii) of Theorem { 7 8 } we have 

The hierarchical FPL differs from a direct FPL over all experts 8. One potential 
way to prove a bound on direct FPL may be to show (if it holds) that FPL per- 
forms better than FPL, i.e. t\-rr<£\:T- Another way may be to suitably generalize 
Theorem 4 to expert dependent e. 



7 Miscellaneous 



Lower Bound on FPL. For finite n, a lower bound on FPL similar to the upper 
bound in Theorem 2 can also be proven. For any V C ]R n and s t G M such that 



12 



the required extrema exist, gG-ZR™, and e t >0 decreasing, the loss of FPL for uni- 
form complexities can be lower bounded in terms of the best predictor in hindsight 
plus/minus additive corrections: 

£M( S<t - ^)° Si > mm{d°si:r} - ^maxLi°g} + £(I--L)M(s <t )°g (8) 

For V = £ and any S and all k % equal and P[<f] = e~ ql for g>0 and decreasing £ t >0, 
this reduces to 

U, >s™-— (9) 

e T 

The upper and lower bounds on £ 1:T (Theorem 4 and Corollary 3 and (9)) together 
show that 

£l:t , 



„min 
S l:t 



if 5 t ^0 and e t -s?£ n ->oo and k l = KVi. (10) 



For instance, e t = J K/2s™ n . For s t = J K/2(£ <t +l) we proved the bound in Theo- 



rem 7(n). Knowing that J K/2(£ <t +l) converges to J K/2s™l n due to (10), we can 



derive a bound similar to Theorem 7(ii) for e t — yK/2s™ n . This choice for e t has 
the advantage that we do not have to compute £ <t (see below), as also achieved by 
Theorem 8(ii). We do not know whether (8) can be generalized to expert dependent 
complexities k l . 

Initial versus independent randomization. So far we assumed that the per- 
turbations are sampled only once at time t — 0. As already indicated, under the 
expectation this is equivalent to generating a new perturbation q t at each time step 
t. While the former way is favorable for the analysis, the latter may have two advan- 
tages. First, if the losses are generated by an adaptive adversary, then he may after 
some time figure out the random perturbation and use it to force FPL to have a 
large loss. Second, repeated sampling of the perturbations guarantees better bounds 
with high probability. 

Bounds with high probability. We have derived several bounds for the expected 
loss £i : t of FPL. The actual loss at time t is u t — M(s <t + ^jf-) °s t . A simple Markov 
inequality shows that the total actual loss u 1:T exceeds the total expected loss £\-t — 
E[ui : t] by a factor of c> 1 with probability at most 1/c: 

P[ui :T > C-£ 1:T ] < 1/C 

Randomizing independently for each t as described in the previous paragraph, the 
actual loss is u t = M(s <t + ^ L )°s t with the same expected loss £\-t — E[u 1: t] as 
before. The advantage of independent randomization is that we can get a much 
better high-probability bound. We can exploit a Chernoff-Hoeffding bound [McD89, 
Cor. 5. 2b], valid for arbitrary independent random variables 0<u t <l for t — l,...,T: 

«i:T -E[u 1:T }\ > SE[u 1:T ]\ < 2exp(-|5 2 £[w 1:T ]), < 8 < 1. 
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we get 



P[\ui:T ~ h-.rl > yfectvr] < 2e~ c as soon as £ 1:T > 3c. (11) 

Using (11), the bounds for £± : t of Theorems 5-8 can be rewritten to yield similar 
bounds with high probability (1 — 2e~ c ) for u\-t with small extra regret oc y/c-L or 



ocyc-s\. T . Furthermore, (11) shows that with high probability, u\-,t/£-v.t converges 
rapidly to 1 for 1\-,t— >oo. Hence we may use the easier to compute e t — \J K/2u <t 
instead of e t = \J K/2(£ <t + l), with similar bounds on the regret. 

Computational Aspects. It is easy to generate the randomized decision of FPL. 
Indeed, only a single initial exponentially distributed vector q e JR n is needed. Only 
for adaptive e t <x.\ / ' \f£<~t ( see Theorem 7) we need to compute expectations explicitly. 
Given e t , from t^t+1 we need to compute 1% in order to update Et- Note that It — 
wfs t , where w\ = P[I t = i] and I t :=argmin; 6 g{s^ t + fc '~ <? ' } is the actual (randomized) 
prediction of FPL. With s: = s <t + k/e t , P[It — i] has the following representation: 



In the last equality we expanded the product and performed the resulting expo- 
nential integrals. For finite n, the one- dimensional integral should be numerically 
feasible. Once the product lTj=i(l — e - ^^^"" 1 ^) has been computed in time 0(n), 
the argument of the integral can be computed for each % in time 0(1), hence the 
overall time to compute £ t is O(c-n), where c is the time to numerically compute 
one integral. For infinite 3 n, the last sum may be approximated by the dominant 
contributions. The expectation may also be approximated by (monte carlo) sam- 
pling I t several times. Recall that approximating £ <t can be avoided by using s™ t m 
(Theorem 8) or u <t (bounds with high probability) instead. 

Deterministic prediction and absolute loss. Another use of wt from the last 
paragraph is the following: If the decision space is V = A, then FPL may make a 
deterministic decision d = w t E A at time t with bounds now holding for sure, instead 
of selecting with probability w\. For example for the absolute loss s\ = \x t — y\\ 
with observation x t G [0,1] and predictions y l t (z [0,1], a master algorithm predicting 
deterministically Wt°yt&[0,l] suffers absolute loss \x t —w t yt\<J2i'wl\xt—yl\=£t, and 
hence has the same (or better) performance guarantees as FPL. In general, masters 
can be chosen deterministic if prediction space y and loss-function Loss(x,y) are 
convex. 



3 For practical realizations in case of infinite n, one must use finite subclasses of increasing size, 
compare [LW94]. 
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8 Discussion and Open Problems 



How does FPL compare with other expert advice algorithms? We briefly discuss 
four issues. 

Static bounds. Here the coefficient of the regret term yKL, referred to as the 
leading constant in the sequel, is 2 for FPL (Theorem 5). It is thus a factor of 
worse than the Hedge bound for arbitrary loss [FS97], which is sharp in some sense 
[Vov95]. For special loss functions, the bounds can sometimes be improved, e.g. to 
a leading constant of 1 in the static WM case with 0/1 loss [CB97]. 

Dynamic bounds. Not knowing the right learning rate in advance usually costs a 
factor of This is true for Hannan's algorithm [KV03] as well as in all our cases. 
Also for binary prediction with uniform complexities and 0/1 loss, this result has 
been established recently - [YEYS04] show a dynamic regret bound with leading 
constant \/2(l+e). Remarkably, the best dynamic bound for a WM variant proven 
in [ACBG02] has a leading constant 2v^2, which matches ours. Considering the 
difference in the static case, we therefore conjecture that a bound with leading 
constant of 2 holds for a dynamic Hedge algorithm. 

General weights. While there are several dynamic bounds for uniform weights, 
the only result for non- uniform weights we know of is [Gen03, Cor. 16], which gives 
a dynamic bound for a p-norm algorithm for the absolute loss if the weights are 
rapidly decaying. Our hierarchical FPL bound in Theorem 9 (b) generalizes it to 
arbitrary weights and losses and strengthens it, since both, asymptotic order and 
leading constant, are smaller. Also the FPL analysis gets more complicated for 
general weights. We conjecture that the bounds oc \/Tk i and oc \j s\. T k i also hold 
without the hierarchy trick, probably by using expert dependent learning rate e\. 

Comparison to Bayesian sequence prediction. We can also compare the worst- 
case bounds for FPL obtained in this work to similar bounds for Bayesian sequence 
prediction. Let {v{\ be a class of probability distributions over sequences and assume 
that the true sequence is sampled from fiE {ui\ with complexity k^ (X)j2 -fc "' < 1). 
Then it is known that the Bayes-optimal predictor based on the 2 _fc ^-weighted 
mixture of i/j's has an expected total loss of at most L fl + 2\ / Lvk>*+2k^, where L M is 
the expected total loss of the Bayes-optimal predictor based on \x [Hut03a, Thm.2]. 
Using FPL, we obtained the same bound except for the leading order constant, but 
for any sequence independently of the assumption that it is generated by /i. This 
is another indication that a PEA bound with leading constant 2 could hold. See 
[Hut03b, Sec. 6. 3] for a more detailed comparison of Bayes bounds with PEA bounds. 
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